[`peft`] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

tomaarsen · 2024-10-18T09:31:44Z

Resolves #2995, resolves huggingface/peft#2154

Hello!

Pull Request overview

If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask

Details

Sentence Transformer models are sometimes trained with the AutoModel wrapped in PEFT, as that can lead to decreased computation cost while training. In particular, when PEFT with prompt learning is used, then virtual tokens (or rather, just input_embeds) are added to the model, and the attention_mask is updated before the base AutoModel is called.

However, then the attention mask used in the Pooling module won't be updated. This PR fixes that.

Concern

My primary concern now is that the model doesn't seem to be able to train well:

import random
import logging
from datasets import load_dataset, Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.models import Pooling, Transformer
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

# 1. Load a model to finetune with 2. (Optional) model card data
cls_pooling = False
if cls_pooling:
    transformer = Transformer("microsoft/mpnet-base")
    pooling = Pooling(transformer.get_word_embedding_dimension(), "cls")
    model = SentenceTransformer(
        modules=[transformer, pooling],
        model_card_data=SentenceTransformerModelCardData(
            language="en",
            license="apache-2.0",
            model_name="MPNet base trained on GooAQ triplets",
        ),
    )
else:
    model = SentenceTransformer(
        "microsoft/mpnet-base",
        model_card_data=SentenceTransformerModelCardData(
            language="en",
            license="apache-2.0",
            model_name="MPNet base trained on GooAQ triplets",
        ),
    )

# Apply PEFT with PromptTuningConfig
peft_config = PromptTuningConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    prompt_tuning_init=PromptTuningInit.RANDOM,
    num_virtual_tokens=1
)
model[0].auto_model = get_peft_model(model[0].auto_model, peft_config)

# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset = dataset.add_column("id", range(len(dataset)))
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"]
eval_dataset: Dataset = dataset_dict["test"]

# 4. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 5. (Optional) Specify training arguments
run_name = "mpnet-base-gooaq-peft"
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir=f"models/{run_name}",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=2,
    logging_steps=250,
    logging_first_step=True,
    run_name=run_name,  # Will be used in W&B if `wandb` is installed
)

# 6. (Optional) Create an evaluator & evaluate the base model
# The full corpus, but only the evaluation queries
# corpus = dict(zip(dataset["id"], dataset["answer"]))
random.seed(12)
queries = dict(zip(eval_dataset["id"], eval_dataset["question"]))
corpus = (
    {qid: dataset[qid]["answer"] for qid in queries} |
    {qid: dataset[qid]["answer"] for qid in random.sample(range(len(dataset)), 20_000)}
)
relevant_docs = {qid: {qid} for qid in eval_dataset["id"]}
dev_evaluator = InformationRetrievalEvaluator(
    corpus=corpus,
    queries=queries,
    relevant_docs=relevant_docs,
    show_progress_bar=True,
    name="gooaq-dev",
)
dev_evaluator(model)

# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.remove_columns("id"),
    eval_dataset=eval_dataset.remove_columns("id"),
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# (Optional) Evaluate the trained model on the evaluator after training
dev_evaluator(model)

# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")

# 9. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name, private=True)

Regardless of whether I use mean or CLS pooling.

@BenjaminBossan could you 1) verify that the PR diff looks solid at a glance and 2) let me know if a model with this config is supposed to train roughly as well as with "full" training?

peft_config = PromptTuningConfig(
    task_type=TaskType.FEATURE_EXTRACTION,
    prompt_tuning_init=PromptTuningInit.RANDOM,
    num_virtual_tokens=1
)

Tom Aarsen

…e attention mask

BenjaminBossan · 2024-10-18T09:55:42Z

Thanks a lot for quickly providing this solution to the issue. Regarding the question whether this should train well, I honestly don't have much experience with the prompt learning methods, so I can't say what I would expect. The fix itself looks correct to me, so I would proceed even if training does not work well at the moment.

Perhaps @VecherVhatuX or @mruniverse8 could give this branch a try and report back their results.

VecherVhatuX · 2024-10-21T13:31:53Z

Oh my gosh, is this really a good idea?

Using torch.ones() in the attention mask for virtual tokens feels, like, super wobbly, don’t you think? Shouldn’t we be using zeros instead? I mean, the model treating padding like it's the main character in a rom-com is just too much!

Moreover, what’s the right way to handle this—ones or zeros? I’m just trying to make sure we’re not making a coding catastrophe!

tomaarsen · 2024-10-21T13:41:39Z

torch.ones() is correct - my implementation is heavily inspired by the PeftModelForFeatureExtraction: https://github.com/huggingface/peft/blob/e8259ff7bc72365ec730efb1f4005a95071d5dfd/src/peft/peft_model.py#L2616-L2619

It only differs in that I place the tensor on the correct device immediately rather than first initializing it and then moving it.

Tom Aarsen

BenjaminBossan · 2024-10-21T13:51:25Z

Using torch.ones() in the attention mask for virtual tokens feels, like, super wobbly, don’t you think? Shouldn’t we be using zeros instead? I mean, the model treating padding like it's the main character in a rom-com is just too much!

Could you explain why using ones would be wobbly? The virtual tokens are valid tokens that could be attended to, right? They're not just padding.

VecherVhatuX · 2024-10-21T15:09:57Z

Oh, I just had concerns about using ones for their attention masks, I do not have enough experience to make a good example. Just do not know what is better? To use ones or zeros for padding. If it's ok then, no problem

If AutoModel is wrapped with PEFT for prompt learning, then extend th…

9ddf2d5

…e attention mask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`peft`] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

[`peft`] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

tomaarsen commented Oct 18, 2024

BenjaminBossan commented Oct 18, 2024

VecherVhatuX commented Oct 21, 2024

tomaarsen commented Oct 21, 2024 •

edited

Loading

BenjaminBossan commented Oct 21, 2024

VecherVhatuX commented Oct 21, 2024

[peft] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

Are you sure you want to change the base?

[peft] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

Conversation

tomaarsen commented Oct 18, 2024

Pull Request overview

Details

Concern

BenjaminBossan commented Oct 18, 2024

VecherVhatuX commented Oct 21, 2024

tomaarsen commented Oct 21, 2024 • edited Loading

BenjaminBossan commented Oct 21, 2024

VecherVhatuX commented Oct 21, 2024

[`peft`] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

[`peft`] If AutoModel is wrapped with PEFT for prompt learning, then extend the attention mask #3000

tomaarsen commented Oct 21, 2024 •

edited

Loading